Data Analysis Skill Test

Environment setup

Loading required packages:

In addition to the loaded packages, we will also call a few functions from the following libraries: naniar, lubridate, readxl, and Hmisc. We will also use an RMarkDown theme from the rmdformats package. If you don’t have these packages in your environment, you should uncomment and run the following code chunk:

For prettier plots using ggplot2, let’s use a custom theme:

Case 1

Reading the dataset from the csv file:

Exercise 1.1

Make an exploratory data analysis

By inspecting the dataset using the glimpse function, we see that it has 186 observations and 3 variables.

## Observations: 186
## Variables: 3
## $ isocode <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA"…
## $ year    <dbl> 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1…
## $ rtfpna  <dbl> 0.6171479, 0.6295884, 0.6384513, 0.6518582, 0.6461794, 0.6687…

The summary function shows that the feature year range from 1950 to 2011 and rtfpna ranges from 0.61 to 1.38.

##    isocode               year          rtfpna      
##  Length:186         Min.   :1950   Min.   :0.6171  
##  Class :character   1st Qu.:1965   1st Qu.:0.8551  
##  Mode  :character   Median :1980   Median :0.9950  
##                     Mean   :1980   Mean   :0.9756  
##                     3rd Qu.:1996   3rd Qu.:1.0463  
##                     Max.   :2011   Max.   :1.3837

There are 62 observations for each country, one for every year between 1950 and 2011:

TFP dataset - Number of observations per country
isocode n
CAN 62
MEX 62
USA 62

Now that we know the structure of the dataset, it is important to check for missing values. The naniar package provides some useful functions for spotting and handling missingness. The plot below shows that missing data is not an issue in the TFP dataset.

Before moving up to data visualization, let’s create three new features by normalizing TFP to 1955 values, computing first differences and computing percentage changes:

Since the dataset contains TFP series for three countries, a lineplot is the most obvious way to visualize the data. In order to follow the DRY (Don’t Repeat Yourself) programming principle, we will define a function for making lineplots with the TFP dataset.

Now we can use our custom function to draw a simple interactive lineplot showing the evolution of TFP across the years:

The plot above shows that TFP increased somewhat steadily in the US between 1950 and 2011, while in Mexico it increased until the 1970s and decreased afterwards. The Canadian TFP remained fairly stable.

We can also use the linfeplot_tfp function to plot the features we created. The evolution of TFP normalized to 1955 values makes the steady increase in American TFP even clearer.

In order to prevent overplotting, we will plot each country’s percentage changes in TFP separately. This will require a few additional steps besides calling lineplot_tfp.

Exercise 1.2

Forecast 10 years of the series (if you are performing the exercise in R, use package “forecast”).

The forecast package expects time-series objects, so we will build a list containing the TFP series for each country as separate xts objects.:

To keep things simple, we will use the function auto.arima from the forecast package. This function selects the appropriate autoregressive integrated moving average (ARIMA) model given the time series at hand.

Now we can draw plots showing the forecast of the most appropriate ARIMA model for each time series. The code below draws all the plots at once using the purr function and stores them in a list called forecast_plots. At the end, it also shows the plot for the USA.

For completeness, the plots for Mexico and Canada are presented below:

Exercise 1.3

Case 2

Reading the dataset and taking a first look at its variables:

## Observations: 117,965
## Variables: 8
## $ date    <date> 1997-01-01, 1997-01-01, 1997-01-01, 1997-01-01, 1997-01-01, …
## $ product <chr> "corn", "corn", "corn", "corn", "corn", "corn", "corn", "corn…
## $ state   <chr> "ES", "GO", "GO", "GO", "MG", "MS", "PE", "PE", "PR", "PR", "…
## $ country <chr> "United States", "Argentina", "Bolivia", "United States", "Ar…
## $ type    <chr> "Import", "Import", "Export", "Export", "Import", "Export", "…
## $ route   <chr> "Sea", "Ground", "Ground", "Sea", "Ground", "Ground", "Sea", …
## $ tons    <dbl> 44.045, 54.000, 0.200, 3.488, 27.000, 40.000, 6300.000, 2625.…
## $ usd     <dbl> 113029, 36720, 180, 5688, 18630, 38700, 847350, 324188, 17010…

It will be useful to have separate variables corresponding to the year and month from the date variable. Let’s create such features using the lubridate package:

For prettier feature and product names in plots and tables, let’s remove underscores and capitalize the first letter.

Exercise 2.1

Show the evolution of total monthly and total annual exports from Brazil (all states and to everywhere) of ‘soybeans’, ‘soybean oil’ and ‘soybean meal.

Filtering the dataframe and using prettier product names:

The plots below show the evolution of total yearly exports for the selected produts, both in terms of volume (millions of tons) and value (billions of US dollars). We can see a that soybeans exports increases substantially since early 2000s, while the exports for the other two products remained fairly stable. In addition, we notice that both plots look reasonably similar to one another - in spite of some change in relative prices of soybean oil and soybean meal after 2007.

Looking at monthly data, we notice that soybeans exports show a clear seasonal pattern:

We can further investigate the seasonality in soybeans exports using the month variable we created above. The plot below shows the average exports by month. We can see exports are higher between March and May, decreasing smoothly in the following months. This is explained by the harvest season.

Exercise 2.2

What are the 3 most important products exported by Brazil in the last 5 years?

Filtering the last 5 years and computing total exported volume and value for each product:

The table below shows the 3 most important products exported by Brazil in terms of total value:

Brazilian exports - Most important products by total exported value
Product Tons (millions) USD (millions)
Soybeans 327.62 123746.66
Sugar 120.04 40947.32
Soybean meal 76.65 28413.61

According to our best understanding, total exported value is better than total exported volume if one wants to identify the most important products. Nevertheless, for the sake of completeness, the table below shows the most important products selected by total amount of tons exported. We can see that soybean meal is no longer in the table, which now includes corn along soybeans and sugar.

Brazilian exports - Most important products by total exported volume
Product Tons (millions) USD (millions)
Soybeans 327.62 123746.66
Corn 151.42 25520.31
Sugar 120.04 40947.32

Going back to the total value exported, the plot below shows each product’ share of the total exports (considering only the products provided in the dataset):

Exercise 2.3

What are the main routes through which Brazil have been exporting ‘corn’ in the last few years? Are there differences in the relative importance of routes depending on the product?

Main routes used in corn exports - Yearly shares from 2017 to 2019
Year Route Share
2017 Sea 91.07
2017 River 6.42
2017 Ground 2.50
2018 Sea 90.52
2018 River 6.17
2018 Ground 3.05
2019 Sea 96.16
2019 Ground 3.20
2019 Other 0.46

The table above shows that sea is by far the most used route in Brazil for exporting corn. But the table alone is not very informative. Let’s build a plot to see the importance of different routes for corn exports across the years:

The figure below shows plots similar to the last one for every product in the dataset. We can see that “Sea” has been the most important route for most of the other products as well. The sole exception is wheat between 2005 and 2011: during this period, the main route for this product was “Ground”.

The sudden change in route usage for wheat exports deserves further investigation. Drawing a barplot that shows the total exported volume over the years (instead of the shares for each route), we can actually see that wheat exports were minimal until 2012 - precisely when sea became the main export route.

Exercise 2.4

Which countries have been the most important trade partners for Brazil in terms of ‘corn’ and ‘sugar’ in the last 3 years?

The table below shows the top-5 destinations of Brazilian corn and sugar in the last 3 years (2016 to 2019).

Brazilian exports of corn and sugar - Average yearly exports to most important trade partners - 2016 to 2019
Product Rank Country Avg value per year (millions of USD) Avg volume per year (millions of tons) Share of product exports (2016-2019)
Corn 1 Iran 564.47 3.29 0.17
Corn 2 Japan 315.00 1.94 0.10
Corn 3 Vietnam 310.21 1.89 0.10
Corn 4 Egypt 275.01 1.67 0.09
Corn 5 Spain 267.16 1.65 0.08
Sugar 1 Algeria 435.28 1.34 0.10
Sugar 2 Bangladesh 412.63 1.27 0.09
Sugar 3 India 339.22 1.02 0.08
Sugar 4 United Arab Emirates 307.68 0.91 0.07
Sugar 5 Saudi Arabia 276.09 0.85 0.06

It is important to notice that the table above shows the top-5 importers of each product considering the sum of countries’ imports across the last 3 years. Thus, it is possible that other countries rank top-5 in a particular year. The code below identifies all countries that ranked top-5 in Brazilian imports of sugar or corn, in any year between 2016 and 2019. This list will be useful for drawing some plots later on.

## $Sugar
## [1] "Bangladesh"           "India"                "Algeria"             
## [4] "United Arab Emirates" "Malaysia"             "Saudi Arabia"        
## [7] "Nigeria"              "China"               
## 
## $Corn
## [1] "Iran"        "Egypt"       "Japan"       "Spain"       "Vietnam"    
## [6] "Malaysia"    "South Korea"

The code below draws interactive plots showing how much each of these countries imported of sugar or corn, compared to total Brazilian exports of such products, from 1997 to 2019.

The plot showing the distribution of corn exports by destination is stored in the list top_importers_plots and we can see it by typing:

The same for the plot showing sugar exports:

Exercise 2.5

For each of the products in the dataset, show the 5 most important states in terms of exports.

In order to identify the most important states in terms of exports, we will use each state’s total exported volume over the last 5 years. Using this strategy, we can built the table below, which shows detailed data for selected states, for each of the products:

Brazilian States - 5 greatest exporters, per product - 2015 to 2019
Product Rank State Share Tons (millions, 2015-2019 Avg tons per year (millions) USD (millions, 2015-2019) Avg USD per year (millions)
Corn 1 MT 0.59 89.66 17.93 14959.41 2991.88
Corn 2 PR 0.11 17.21 3.44 2867.01 573.40
Corn 3 GO 0.10 15.87 3.17 2767.30 553.46
Corn 4 MS 0.07 10.29 2.06 1699.12 339.82
Corn 5 SP 0.03 4.69 0.94 815.90 163.18
Soybean meal 1 MT 0.34 26.10 5.22 10215.85 2043.17
Soybean meal 2 PR 0.22 16.59 3.32 5950.50 1190.10
Soybean meal 3 RS 0.16 12.33 2.47 4258.09 851.62
Soybean meal 4 GO 0.12 9.19 1.84 3419.01 683.80
Soybean meal 5 BA 0.07 5.17 1.03 1690.44 338.09
Soybean oil 1 PR 0.41 2.81 0.56 2005.29 401.06
Soybean oil 2 MT 0.20 1.37 0.27 980.99 196.20
Soybean oil 3 RS 0.18 1.25 0.25 865.87 173.17
Soybean oil 4 GO 0.11 0.79 0.16 570.31 114.06
Soybean oil 5 SC 0.04 0.29 0.06 218.71 43.74
Soybeans 1 MT 0.27 87.94 17.59 32985.07 6597.01
Soybeans 2 RS 0.18 57.55 11.51 21954.34 4390.87
Soybeans 3 PR 0.15 50.45 10.09 19094.17 3818.83
Soybeans 4 GO 0.07 22.48 4.50 8495.29 1699.06
Soybeans 5 MS 0.06 18.47 3.69 6962.46 1392.49
Sugar 1 SP 0.65 77.95 15.59 26342.19 5268.44
Sugar 2 MG 0.11 13.75 2.75 4623.48 924.70
Sugar 3 PR 0.10 12.38 2.48 4172.83 834.57
Sugar 4 AL 0.04 5.31 1.06 1848.88 369.78
Sugar 5 MS 0.04 4.45 0.89 1503.49 300.70
Wheat 1 RS 0.16 5.78 1.16 1117.42 223.48
Wheat 2 SP 0.15 5.28 1.06 1203.63 240.73
Wheat 3 CE 0.14 4.87 0.97 1005.56 201.11
Wheat 4 BA 0.10 3.40 0.68 703.98 140.80
Wheat 5 PR 0.09 3.20 0.64 646.41 129.28

For a more detailed analysis, we will build barplots showing the evolution of each product’s exports over the years. The barplots will also depict the contribution of the 5 most important states mentioned in the previos table. As a first step towards building these plots, the code below builds a named list with the names of the top 5 states for each product.

Now we can use the information on top_states_list to modify the comex dataframe so that the variable Product identifies only the 5 most important states with respect to each product exports. All the other states are labeled as "Others".

With the modified dataframe, we can use the map function to draw all the plots at once and store them in a named list. At the end of the following code chunk we also show the plot concerning corn exports.

For completeness, the remaining plots are shown below.

Exercise 2.6

Now, we ask you to show your modelling skills. Feel free to use any type of modelling approach, but bear in mind that the modelling approach depends on the nature of your data, and so different models yield different estimates and forecasts. To help you out in this task we also provide you with a dataset of possible covariates (.xlsx). They all come from public sources (IMF, World Bank) and are presented in index number format. Question: What should be the total brazilian soybeans, soybean_meal, and corn export forecasts, in tons, for the next 11 years (2020-2030)? We’re mostly interested in the annual forecast.

Reading the covariates dataset and looking what it contains:

## Observations: 52
## Variables: 13
## $ year               <dbl> 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 19…
## $ price_soybeans     <dbl> NA, 100.00000, 98.52551, 83.40619, 97.85174, 97.05…
## $ price_corn         <dbl> NA, 100.00000, 103.88831, 85.98643, 108.16806, 108…
## $ price_soybean_meal <dbl> NA, 100.00000, 99.80544, 86.59696, 98.66332, 84.71…
## $ gdp_china          <dbl> 100.0000, 107.9000, 113.4029, 123.6092, 136.9590, …
## $ gdp_iran           <dbl> 100.00000, 81.20000, 74.21680, 83.93920, 95.52281,…
## $ gpd_netherlands    <dbl> NA, 100.00000, 99.50000, 98.20650, 99.97422, 103.0…
## $ gdp_spain          <dbl> 100.0000, 101.2000, 100.7952, 102.0047, 103.7388, …
## $ gdp_thailand       <dbl> 100.0000, 104.6000, 110.7714, 116.7531, 123.2912, …
## $ gdp_world          <dbl> 100.0000, 102.1000, 104.0399, 104.6641, 107.5947, …
## $ gdp_egypt          <dbl> 100.0000, 103.4000, 105.6748, 113.3891, 123.4807, …
## $ gdp_japan          <dbl> 100.0000, 103.2000, 107.5344, 111.0830, 114.9709, …
## $ gdp_vietnam        <dbl> 100.0000, 96.5000, 102.0970, 110.4690, 118.3122, 1…

We can see some missing data. Let’s investigate this further:

Only a few features have missing data, and only in the first period from the series. It seems safe to just drop the observations with incomplete data.

ARIMA

We will try some simple ARIMA models using the forecast package. Before fitting the model, we have to convert the export series into time series objects.

For our initial modelling, we will consider the following features for each product. Notice that for each product we are only using its own price, as well as the GDP of its main importers.

Now that we stored the names of the selected covariates in a list, we can use it to subset our covariates dataset and building matrices with the regressors for each model. Notice that we are also splitting the covariate series into past data (xreg_list) and data data will be used only for prediction (xreg_list_pred).

Data is already in the right shape. We can now fit our ARIMA models:

Let’s see how each model’s predictions look like:

Elastic Net

As an additional exercise, we will also fit Elastic Net models using many lagged covariates. The Elastic Net model has both Ridge and Lasso regularization, and it sometimes it is a good way of handling feature selection and multicollinearity.

Let’s build lists similar to the ones we used before, but now with simple dataframes (not xts objects):

## # A tibble: 23 x 2
##     Year million_tons
##    <dbl>        <dbl>
##  1  1997        0.739
##  2  1998        1.38 
##  3  1999        0.706
##  4  2000        1.54 
##  5  2001        5.93 
##  6  2002        3.06 
##  7  2003        4.29 
##  8  2004        5.22 
##  9  2005        1.64 
## 10  2006        4.89 
## # … with 13 more rows

Including the all the covariates in the datasets inside data_elnet:

## Observations: 34
## Variables: 14
## $ year               <date> 1997-01-01, 1998-01-01, 1999-01-01, 2000-01-01, 2…
## $ million_tons       <dbl> 0.7394639, 1.3839176, 0.7062755, 1.5438019, 5.9321…
## $ price_soybeans     <dbl> 105.76163, 84.16003, 65.92870, 68.99383, 63.60515,…
## $ price_corn         <dbl> 93.20418, 80.83137, 71.82394, 70.17352, 71.27912, …
## $ price_soybean_meal <dbl> 124.01265, 77.42172, 67.60175, 82.93492, 80.02914,…
## $ gdp_china          <dbl> 560.7227, 604.4591, 651.0025, 706.3377, 765.6701, …
## $ gdp_iran           <dbl> 106.9763, 109.0088, 109.3359, 116.8800, 117.8151, …
## $ gpd_netherlands    <dbl> 152.7382, 159.9169, 167.9127, 174.9651, 178.9893, …
## $ gdp_spain          <dbl> 156.7281, 163.7808, 171.4785, 180.2240, 187.2527, …
## $ gdp_thailand       <dbl> 345.1241, 318.8946, 333.5638, 348.5742, 360.4257, …
## $ gdp_world          <dbl> 173.9366, 178.4590, 184.8835, 193.7579, 198.6018, …
## $ gdp_egypt          <dbl> 218.6837, 235.0850, 249.4251, 262.8941, 272.0954, …
## $ gdp_japan          <dbl> 180.1338, 178.1523, 177.6178, 182.5911, 183.3215, …
## $ gdp_vietnam        <dbl> 299.0952, 316.4427, 331.6320, 354.1829, 378.6216, …

Custom function to add lagged variables to a dataframe:

The full set of covariates (with lags) will be the same for every product. Feature selection will be handled by regularization. We will use a specific dataframe from data_elnet to build the lags, but the resulting dataframe will be used as covariates in the models for all three products.

Setting appart the covariates relative to the years we want to forecast:

Filtering the dataframe containing all lags so that it has only historical data and building the covariate matrix:

Since the result variable (what we want to predict) depends on the product, we will build a list to hold the series for the three products:

We will train the models using data from only the first years, leaving the last 5 years of observed data for model assessment and hyperparameter tuning:

We will try multiple combinations of lambda and alpha (the Elastic Net hyperparameters).

The chunk below does many things at once: (i) it trains a glmnet model for every combination of alpha and lambda, for each of the three products; (ii) it use the trained models to make predictions using the held-out data (last 5 years); (iii) it computes the RMSE on the held-out data; (iv) it uses x.new to forecast exports for 2020-2030; (v) it checks how many coefficients were non-zero in each model (i.e., how many features the Elastic Net selected). Every step is stored in a nested dataframe, which is then split based on the products. The dataframe with results for corn is shown at the end.

## # A tibble: 275 x 8
##    lambda alpha product models  predictions        rmse forecast  non_zero_coefs
##     <dbl> <dbl> <chr>   <list>  <list>            <dbl> <list>             <dbl>
##  1  1       1   Corn    <elnet> <dbl[,1] [5 × 1]> 0.206 <dbl [11…              3
##  2  0.316   0.3 Corn    <elnet> <dbl[,1] [5 × 1]> 0.428 <dbl [11…              8
##  3  1       0.9 Corn    <elnet> <dbl[,1] [5 × 1]> 0.497 <dbl [11…              6
##  4  0.316   0.2 Corn    <elnet> <dbl[,1] [5 × 1]> 0.626 <dbl [11…             10
##  5  0.316   0.1 Corn    <elnet> <dbl[,1] [5 × 1]> 0.700 <dbl [11…             15
##  6  0.01    0.1 Corn    <elnet> <dbl[,1] [5 × 1]> 0.778 <dbl [11…             26
##  7  3.16    0   Corn    <elnet> <dbl[,1] [5 × 1]> 0.943 <dbl [11…             51
##  8  1       0.2 Corn    <elnet> <dbl[,1] [5 × 1]> 1.04  <dbl [11…             15
##  9  0.1     0.1 Corn    <elnet> <dbl[,1] [5 × 1]> 1.05  <dbl [11…             21
## 10  1       0   Corn    <elnet> <dbl[,1] [5 × 1]> 1.10  <dbl [11…             47
## # … with 265 more rows

We can see that the model that performed the best on the held-out data used only 3 features. Considering that the RMSE was quite low, as well as the fact that series are relatively short and the held-out set is small, it is likely that the Elastic Net overfitted. In order to prevent this, we would need to use a stronger cross-validation strategy, perhaps using several (sequential) held-out sets. Once we used these held-out sets to perform model selection and hyperparameter tuning, we could retrain the best performing model on the entire dataset. If having an “honest” assessment of model performance is more important than going after the last drop of performance, we could also use both validation and test sets, leaving the latter untouched until the final model is chosen, tuned and retrained on the entire train + validation dataset. Gathering more data and experimenting with multi-step-ahead forecasting would also be helpful.

In any case, we will plot the results for the best performing Elastic Net for forecasting corn exports, just to get a feeling of the model’s predictions.

André Luís Menegatti

7/6/2020